Goto

Collaborating Authors

 software engineering conference and symposium


FAIREDU: A Multiple Regression-Based Method for Enhancing Fairness in Machine Learning Models for Educational Applications

Pham, Nga, Do, Minh Kha, Dai, Tran Vu, Hung, Pham Ngoc, Nguyen-Duc, Anh

arXiv.org Artificial Intelligence

Fairness in artificial intelligence and machine learning (AI/ML) models is becoming critically important, especially as decisions made by these systems impact diverse groups. In education, a vital sector for all countries, the widespread application of AI/ML systems raises specific concerns regarding fairness. Current research predominantly focuses on fairness for individual sensitive features, which limits the comprehensiveness of fairness assessments. This paper introduces FAIREDU, a novel and effective method designed to improve fairness across multiple sensitive features. Through extensive experiments, we evaluate FAIREDU effectiveness in enhancing fairness without compromising model performance. The results demonstrate that FAIREDU addresses intersectionality across features such as gender, race, age, and other sensitive features, outperforming state-of-the-art methods with minimal effect on model accuracy. The paper also explores potential future research directions to enhance further the method robustness and applicability to various machine-learning models and datasets.


Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Wang, Wenxuan

arXiv.org Artificial Intelligence

Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people's work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.


Automated Program Repair: Emerging trends pose and expose problems for benchmarks

Renzullo, Joseph, Reiter, Pemma, Weimer, Westley, Forrest, Stephanie

arXiv.org Artificial Intelligence

A variety of techniques have been developed, e.g., evolutionary computation[60, 133], methods incorporating templated mutation operators[71], semantic inference techniques[79] targeting single-cause defects, and methods designed to handle multi-hunk bugs[100]. Increasingly, researchers have applied ML-based methods to APR tasks (Section 3), but data leakage is a concern(Section 4). Each new technique, or modification of an existing technique, tends to be developed by an independent research team, without reference to a common, formal definition of APR. Benchmarks are not enough to standardize evaluation on their own (Section 5). As motivating examples, consider the following inconsistencies in the published literature: Correctness. VFix [123] identifies correct patches that pass all test cases and are semantically or syntactically equivalent to the original bug-fix, while VRepair[26] reports repair accuracy in terms of semantic equivalence to the original bug-fix, and SynFix [10] defines correctness simply as passing the test cases. Each of these is a reasonable definition, but collectively, their differences make it difficult to compare results.


A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research

Cao, Sicong, Sun, Xiaobing, Widyasari, Ratnadira, Lo, David, Wu, Xiaoxue, Bo, Lili, Zhang, Jiale, Li, Bin, Liu, Wei, Wu, Di, Chen, Yixin

arXiv.org Artificial Intelligence

The remarkable achievements of Artificial Intelligence (AI) algorithms, particularly in Machine Learning (ML) and Deep Learning (DL), have fueled their extensive deployment across multiple sectors, including Software Engineering (SE). However, due to their black-box nature, these promising AI-driven SE models are still far from being deployed in practice. This lack of explainability poses unwanted risks for their applications in critical tasks, such as vulnerability detection, where decision-making transparency is of paramount importance. This paper endeavors to elucidate this interdisciplinary domain by presenting a systematic literature review of approaches that aim to improve the explainability of AI models within the context of SE. The review canvasses work appearing in the most prominent SE & AI conferences and journals, and spans 63 papers across 21 unique SE tasks. Based on three key Research Questions (RQs), we aim to (1) summarize the SE tasks where XAI techniques have shown success to date; (2) classify and analyze different XAI techniques; and (3) investigate existing evaluation approaches. Based on our findings, we identified a set of challenges remaining to be addressed in existing studies, together with a roadmap highlighting potential opportunities we deemed appropriate and important for future work.


Large Language Models for Software Engineering: A Systematic Literature Review

Hou, Xinyi, Zhao, Yanjie, Liu, Yue, Yang, Zhou, Wang, Kailong, Li, Li, Luo, Xiapu, Lo, David, Grundy, John, Wang, Haoyu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we conducted a systematic literature review on LLM4SE, with a particular focus on understanding how LLMs can be exploited to optimize processes and outcomes. We collect and analyze 229 research papers from 2017 to 2023 to answer four key research questions (RQs). In RQ1, we categorize different LLMs that have been employed in SE tasks, characterizing their distinctive features and uses. In RQ2, we analyze the methods used in data collection, preprocessing, and application highlighting the role of well-curated datasets for successful LLM for SE implementation. RQ3 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE. Finally, RQ4 examines the specific SE tasks where LLMs have shown success to date, illustrating their practical contributions to the field. From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study.


Enhanced Fairness Testing via Generating Effective Initial Individual Discriminatory Instances

Ma, Minghua, Tian, Zhao, Hort, Max, Sarro, Federica, Zhang, Hongyu, Lin, Qingwei, Zhang, Dongmei

arXiv.org Artificial Intelligence

Fairness testing aims at mitigating unintended discrimination in the decision-making process of data-driven AI systems. Individual discrimination may occur when an AI model makes different decisions for two distinct individuals who are distinguishable solely according to protected attributes, such as age and race. Such instances reveal biased AI behaviour, and are called Individual Discriminatory Instances (IDIs). In this paper, we propose an approach for the selection of the initial seeds to generate IDIs for fairness testing. Previous studies mainly used random initial seeds to this end. However this phase is crucial, as these seeds are the basis of the follow-up IDIs generation. We dubbed our proposed seed selection approach I&D. It generates a large number of initial IDIs exhibiting a great diversity, aiming at improving the overall performance of fairness testing. Our empirical study reveal that I&D is able to produce a larger number of IDIs with respect to four state-of-the-art seed generation approaches, generating 1.68X more IDIs on average. Moreover, we compare the use of I&D to train machine learning models and find that using I&D reduces the number of remaining IDIs by 29% when compared to the state-of-the-art, thus indicating that I&D is effective for improving model fairness